Nature Machine Intelligence — Latest Matching Preprints

1

GeoEPred: A Multimodal Structure-Aware Geometric Deep Learning Framework for Gram-Negative Bacterial Secreted Effector Prediction with Sequence Semantics

Song, S.; Shi, H.; Wu, H.; Liu, D.; Lin, Y.; Mat Isa, N. A.; Zou, Q.; Wei, L.

2026-05-20 genomics 10.64898/2026.05.18.725929 medRxiv

Top 0.1%

54.5%

Show abstract

Accurate prediction of effector proteins secreted by Gram-negative bacteria is important for elucidating bacterial pathogenic mechanisms and developing precise anti-infective strategies. Although existing methods have benefited from the strong sequence feature extraction capacity of pretrained protein language models, reliance on linear sequence information alone often fails to fully capture the three-dimensional conformational signals required for virulence functions. Meanwhile, conventional structure-based methods are limited by the scarcity of experimentally resolved protein structures. To address these challenges, We propose GeoEPred, a multimodal deep learning framework designed for the synergistic modeling of protein sequence and structure to identify Gram-negative bacterial effector proteins. Specifically, the model integrates sequence-contextual embeddings from a pretrained protein language model with three-dimensional structural representations predicted by ESMFold. A feature projection network refines fine-grained sequence signals associated with effector functions, while geometric vector perceptrons characterize inter-residue orientations, distances, and local spatial topology to capture potential structural conformational motifs. To further enable effective cross-modal fusion, we design a cross-modal alignment and feature-tokenized self-attention module. This module enhances consistency between the sequence-semantic and structural-geometric spaces through contrastive learning and models associations between linear functional motifs and spatial conformational patterns at a fine-grained token level. Extensive evaluations on multiple benchmark datasets show that GeoEPred achieves better predictive performance than existing leading models in T3SE, T4SE, and T6SE prediction tasks, while maintaining stable performance in remote homolog recognition scenarios. Moreover, the modular and extensible architecture of GeoEPred demonstrates strong generalization ability and substantial application potential for genome-scale effector protein discovery. Author summarySecreted effector proteins are central virulence factors used by many Gram-negative bacterial pathogens to execute infection strategies. Their functions are governed not only by secretion signals and short linear motifs in the amino acid sequence, but also by three-dimensional folds, local domains, and surface geometric patterns. However, current predictors mainly exploit sequence-contextual features, limiting their ability to model the correspondence between linear sequence signals and spatial conformational motifs, and thereby constraining accuracy and interpretability. Here, we present GeoEPred, a multimodal deep learning framework for secreted effector protein identification. GeoEPred couples sequence-semantic embeddings from a pretrained protein language model with structural representations learned by geometric vector perceptrons. A cross-modal alignment and interaction module uses contrastive learning to improve functional consistency between sequence and structure modalities, while feature-token attention captures fine-grained links between key linear and conformational motifs. Across benchmark datasets covering multiple effector types, GeoEPred outperforms existing state-of-the-art methods and provides interpretable evidence from sequence fragments, structural regions, and cross-modal associations, supporting functional annotation, pathogenic mechanism analysis, and experimental validation.

2

Generating antimicrobial peptides via genomic transfer learning

Polloni, L.; Bieniasz, K. D.; Gonteri, I.; Frost, J. M.

2026-06-20 pharmacology and toxicology 10.64898/2026.06.16.732639 medRxiv

Top 0.1%

39.8%

Show abstract

We present a generative machine learning pipeline for the design of linear antimicrobial peptides (AMPs). To extend diversity beyond synthetically validated peptide datasets ([~]7,000 entries), we apply transfer learning by training a Generative Pre-trained Transformer (GPT) on the genomically derived AMPSphere dataset ([~]863,000 entries), before fine-tuning on the Database of Antimicrobial Activity and Structure of Peptides (DBAASP). We assess the filtered sequences with a committee of Minimum Inhibitory Concentration (MIC) predictive models built with a Bi-LSTM architecture, and ESM-2 and QSAR feature vectors. The fine-tuned GPT model produced a 28% reduction in test loss compared to training on DBAASP alone, and generates peptides that are simultaneously more novel and more physicochemically plausible. Our top-ranked candidates are predicted to possess antimicrobial activity comparable to polymyxin B. We anticipate this transfer-learning approach is broadly applicable for leveraging massive, unlabelled genomic datasets to enrich targeted peptide discovery. Our identified sequences have been submitted to the 2027 AMP Challenge1 (team name VINCI) for experimental validation, and the complete codebase and workflow are open source2.

3

RulePep: Interpretable ESM-Guided Neural-Symbolic Peptide Classification

Midjani, F.; Ghelich, R.; Keshtkar, F. Z.; Malekpour, M.; Lee, H.

2026-07-06 bioinformatics 10.64898/2026.07.03.736448 medRxiv

Top 0.1%

31.5%

Show abstract

Peptides are increasingly explored as therapeutic candidates, delivery vectors, and functional biomolecules, but experimental screening of peptide activity and safety remains costly because the sequence space is vast and small sequence changes can alter functionality. Computational peptide classification can therefore help prioritize candidates. However, many protein-language-model-based classifiers achieve strong performance using opaque prediction heads, making it difficult to determine which learned evidence supports or opposes a prediction. We present RulePep, an ESM-2-guided neural-symbolic classifier for peptide-function prediction. RulePep maps frozen ESM-2 sequence representation to learned latent predicates, polarity-constrained differentiable rules, and an additive symbolic logit whose components can be inspected at the case level. We evaluate RulePep on three biologically distinct peptide classification tasks: blood-brain barrier penetration, hemolytic potency, and anticancer activity. On the BBPpredict, HemoPI3, and AntiCP 2.0 alternate benchmark datasets, RulePep achieved AUROC/MCC values of 0.8869/0.6850, 0.9155/0.6820, and 0.9765/0.8633, respectively. Ablation experiments supported the contributions of multi-layer representation pooling, rule polarity, mined-rule initialization, symbolic capacity, and rule-derived aggregation. RulePep combines competitive predictive performance with additive logit reconstruction, rule-level evidence reporting, and predicate-suppression auditing, providing a transparent sequence-based framework for peptide candidate prioritization.

4

Integrative Transfer Network: Deep Transfer Learning Across Populations and Prediction Targets

Gao, Y.; Cui, Y.

2026-06-16 bioinformatics 10.64898/2026.06.12.731936 medRxiv

Top 0.1%

31.1%

Show abstract

Large-scale clinical and biomedical datasets increasingly contain both diverse subgroup attributes (e.g., demographic or clinical subgroups) and multiple prediction targets. Although various machine learning approaches can address subgroup differences or multi-target prediction, they often consider these aspects independently rather than jointly. To more effectively capture the shared and subgroup-specific information in such complex datasets, we propose the Integrative Transfer Network (ITN), a deep neural network designed to leverage data across subgroups and multiple related outcomes simultaneously. In extensive experiments, including time-to-event and classification tasks where demographic subgroups and multiple disease end-points are prevalent, ITN demonstrates consistent improvements in subgroup-specific prediction by borrowing strength from other subgroups and outcomes. We envision ITN as a unified frame-work for learning from heterogeneous datasets where subgroup-specific insights are critical.

5

Sequence-Based Therapeutic Peptide Classification with Augmented Negative Sampling

Ellerbrock, R.; Valentini, A.; Paul, A. C.; Mukhopadhyay, S.; Perelshtein, M. R.

2026-06-11 bioinformatics 10.64898/2026.06.07.730473 medRxiv

Top 0.1%

26.8%

Show abstract

Therapeutic peptides offer high target specificity, low toxicity, and the ability to modulate protein-protein interactions, yet experimental functional characterization remains costly and slow. Computational prediction of therapeutic function directly from sequence could accelerate peptide screening and enable generative design pipelines, but requires reliable discrimination between therapeutic and non-therapeutic peptides. Existing multi-label predictors cover few functions, rely on limited datasets, and exhibit high False Positive Rates (FPRs), limiting their practical utility. We present a lightweight CNN classifier trained on the most comprehensive therapeutic peptide database to date (54,655 peptides, 48 functional categories). A key contribution is a statistically motivated negative sampling strategy using Markov models to generate diverse synthetic decoys at multiple difficulty levels. When evaluated on this controlled decoy benchmark, the FPR is reduced from over 60% for previous models to 2.1% for our approach. On positive therapeutic samples, our fine-tuned five-model ensemble achieves 79.9% Micro F1 and 54.6% Macro F1 while requiring only amino acid sequences as inputs. Analysis using a sparse L1-constrained variant of our model shows that convolutional filters capture conserved functional motifs and statistically improbable non-therapeutic patterns, with downstream layers combining these signals, providing mechanistic evidence that the network learns biologically meaningful structure. On an external generalization benchmark derived from TPpred-LE, our model achieves 55.3% Micro F1 and 38.6% Macro F1 on the 12 shared labels, close to the benchmark-specific baseline (57.9%/38.1%), while retaining substantially broader therapeutic label coverage. Code and models will be made available at https://github.com/terra-quantum-public/tq-therapep-ai.

6

Receptor-Anchored Olfaction Representation through Perception-Consistent Metric Learning

Tian, C.; Wang, J.; Hou, J.; Liu, W.; Luo, Y.; Wang, Y.; Yang, L.; Lin, W.

2026-05-12 bioinformatics 10.64898/2026.05.08.723701 medRxiv

Top 0.1%

26.2%

Show abstract

Olfactory perception arises from distributed activation across hundreds of olfactory receptors (ORs), yet our understanding of this landscape remains constrained by the scarcity of OR affinity measurements. Here, we present Receptor-Anchored Metric Supervision (RAMS), a transfer learning framework using perceptual consistency as weak supervision to predict OR activation spectra. RAMS fine-tunes a pretrained drug-target affinity model by imposing constraints derived from olfactory perception, where similar odorants are encouraged to exhibit similar OR activations. It transfers protein-ligand interaction knowledge learned from large-scale pharmacological data into the olfactory domain and reshapes it toward OR activation prediction. Evaluations against experimental measurements show that RAMS improves the accuracy of receptor-spectrum prediction and yields biologically plausible activation patterns. The predicted spectra show concordance between receptor discriminative capacity and expression level, and highlight the understudied OR52 family as a potential contributor to primary odor recognition. Together, RAMS provides a scalable framework for reconstructing receptor-anchored olfactory representations.

7

Information Bottleneck Dominates Adversarial Training for Ancestry-Invariant Polygenic Risk Prediction: Dimensionality, Not Gradient Reversal, Controls the Fairness-Accuracy Tradeoff

Tran, P. P.; Do, A. T.

2026-04-29 genomics 10.64898/2026.04.24.720752 medRxiv

Top 0.1%

26.0%

Show abstract

In adversarial representation learning for fair prediction, the gradient reversal coefficient ({lambda}) is widely treated as the primary control for sensitive-attribute invariance. We show this assumption is wrong. Using a dual-stream architecture for cross-ancestry polygenic risk score (PRS) prediction, we demonstrate that latent dimensionality -- the information bottleneck -- accounts for 8-27 x more variance in ancestry leakage than adversarial strength. Varying{lambda} across a 20 x range changes leakage by only 2.2 percentage points; varying dimensionality across a 16 x range changes it by 46.6 pp. At dimension 8 with no adversarial training ({lambda} = 0), ancestry leakage is 32.9% (chance = 20%): the bottleneck alone achieves near-invariance. The adversary architecture (linear vs deep MLP) is equally irrelevant (0.6 pp range). We validate this finding across two unrelated domains -- genomic ancestry invariance (6 clinical traits, 1000 Genomes, n = 2,504) and EEG subject invariance (pretrained HFTP + Braindecode dual-domain model, 20 subjects) -- observing consistent dimensionality dominance (12.7:1 ratio in EEG). For the genomic application, Stream 1 encodes population structure via DCT-II frequencydomain features (136 coefficients); Stream 2 encodes phenotype signal from top PRS SNPs (PCA to 128 dimensions). The architecture works equally well with standard genomic PCA as the ancestry stream (R2 = 0.217 vs 0.222), confirming the contribution is architectural, not encoding-specific. African-ancestry PRS reconstruction R2 improves on all six traits (e.g., +5.1 pp for coronary artery disease). Linear models achieve higher aggregate R2 but fail catastrophically on cross-ancestry transfer (R2 = - 12.45 for African-ancestry CAD). We emphasize that we predict PRS (a computed score), not disease phenotypes; validation on biobank-scale phenotype data is ongoing. These results suggest the adversarial fairness community has been over-investing in adversary engineering relative to simple capacity control. Practitioners should select latent dimensionality first to set the information budget for the fairness-accuracy tradeoff, then optionally use adversarial training for marginal refinement.

8

Vibe Coding Specificity Foundation Models

Reddy, S. T.

2026-06-04 synthetic biology 10.64898/2026.06.04.730134 medRxiv

Top 0.1%

25.9%

Show abstract

Molecular recognition -- the determination of which agent binds which target -- governs adaptive immunity, gene regulation, signal transduction, RNA silencing, enzyme catalysis, and the selectivity of therapeutics. Determining binding specificity remains dependent on experimental screening or domain-specific computational tools that do not generalize across binding modalities. Transformer softmax attention is mathematically identical to the Boltzmann distribution governing molecular binding1. This identity, together with five conditions of molecular recognition systems, prescribes a single neural network architecture for cross-modal binding prediction: dual sequence encoders, symmetric contrastive learning, and a learned physical temperature2. A Specificity Foundation Model (SFM) is an instance of this physics-derived, sequence-to-sequence architecture that maps any agent-target sequence pair to a binding compatibility score, enabling bidirectional retrieval across molecular recognition domains without requiring structural information. The first SFM for antibody-antigen binding demonstrated [~]100,000-fold greater data efficiency than comparable vision-language models3. Here we report six SFMs across six molecular recognition domains -- transcription factor-DNA, enzyme-substrate, peptide-MHC, CRISPR gRNA-off-target genomic DNA, microRNA-mRNA target, and small molecule drug-target protein -- using the identical architecture without modification and trained using publicly available data only. Evaluated by cross-modal retrieval from pools of 512 candidates (random baseline 0.2%), in-distribution R@1 ranges from 27.7% to 98.0% across the six domains. mir-SFM retrieves miRNA targets at 98.0% R@1, including the [~]80% of validated interactions that seed-matching tools cannot find. mhcSFM achieves 95.4% R@1 on held-out rare HLA alleles absent from training. Applying crisprSFM to CRISPR off-target prediction improves precision to 94.0% compared to 33.2% from Hamming distance alone. All six SFMs were built by a domain expert with no programming experience using vibe coding -- natural-language-directed AI coding agents -- with numerical claims independently verified by an orthogonal AI auditor. These results establish SFMs as a physics-derived, sequence-native class of model that augments experimental and computational workflows across molecular recognition domains.

9

Finetuning masking challenges narrow-task evaluation of cell foundation models

Shakeel, M. H.; Shen, M.; Mangiola, S.

2026-06-06 bioinformatics 10.64898/2026.06.04.730272 medRxiv

Top 0.1%

23.0%

Show abstract

Single-cell foundation models are large, self-supervised deep learning networks pretrained on millions of cellular transcriptomes. These models promise to deliver cell representations that are transferable across diverse biological domains and, when used in specific tasks, would outperform narrowly scoped models. A central assumption is that more pretraining data translates to better downstream performance. However, despite its centrality, this assumption remains largely untested. Here, we tested downstream performance on gold-standard benchmarking tasks across massive dataset reductions, showing that performance was largely insensitive to pretraining data size once finetuning was allowed. This trend reveals a finetuning masking effect that offsets differences in representation quality induced by pretraining, making the benefit of additional pretraining scale largely invisible under current benchmark settings. These findings challenge current benchmarking standards, which rely on closed-ended finetuning tasks that are too narrow to expose the full representational value of pretraining. They also challenge the main driving force in single-cell foundation-model development when evaluated through common narrow tasks. We propose that the next generation of foundation models should be assessed less by performance on highly optimised finetuning tasks and more by their ability to support open-ended biological inference, frozen-representation evaluation and zero-shot capability.

10

Species- and Topic-aware Representation Learning for Antimicrobial Peptide Discovery

Padi, S.; Mondal, K.; Kaur, N.; Hoogerheide, D. P.; Heinrich, F.; Mihailescu, E.; Klauda, J. B.; Cardone, A.; Keyrouz, W.

2026-06-01 bioinformatics 10.64898/2026.05.28.728246 medRxiv

Top 0.1%

22.2%

Show abstract

Antimicrobial resistance poses a major global health challenge, necessitating efficient strategies to discover potent antimicrobial peptides (AMPs). While recent generative models can produce many candidate sequences, experimentally validating all generated peptides in wet labs is impractical due to the high costs and time involved in such measurements. As a result, there is a strong demand for accurate predictions of peptide efficacy, typically measured as the minimum inhibitory concentration (MIC). We introduce STAMP, a framework for Species- and Topic-aware Representation Learning in AMP Discovery. This unified machine learning framework allows for cross-species predictions of AMP activity. STAMP integrates protein language model embeddings with species conditioning and topic-aware representations that capture sequence-level patterns, enabling generalizable predictions across multiple bacterial species within a single model. We evaluated STAMP on three benchmark datasets, which include two previously published datasets and a newly curated dataset derived from DBAASP, addressing duplicates and inconsistencies systematically. STAMP achieved strong predictive performance across these datasets, demonstrating a Pearson correlation coefficient (PCC) of 0.837 and an R2 of 0.70, outperforming several baseline models. Importantly, we further validated our prediction model using peptides that were experimentally tested for their antimicrobial activity against E.coli. and S.epidermidis bacteria, demonstrating its real-world applicability. Furthermore, residue-level importance analyses provide insights into the sequence determinants governing antimicrobial activity. Together, these results establish STAMP as a scalable framework for MIC prediction and an effective computational tool for accelerating AMP discovery and optimization.

11

Bio-BLIP: A Multimodal Architecture for Transferable Reasoning in Genomic Variant Interpretation

Gupta, A.; Buendia, A.; Kundaje, A.; Leskovec, J.

2026-05-15 genomics 10.64898/2026.05.12.724740 medRxiv

Top 0.1%

22.2%

Show abstract

Developing scientific hypotheses in biology requires integrating heterogeneous evidence across DNA sequence, gene context, protein function, and prior literature. Existing multimodal AI systems expose biological evidence to reasoning models through textification or by projecting biological embeddings into fine-tuned language models. However, these models are typically highly optimized the specific set of tasks for which they are fine-tuned. Here we present Bio-BLIP, a multimodal Q-former based architecture which leverages biological embeddings and a LLM to generalize to complex reasoning tasks without task-specific fine-tuning. The key to Bio-BLIP is a new neural network architecture that integrates four data modalities - DNA, genes, proteins, and text - through a master Qformer model, which integrates the modality-specific information into a fixed-length prefix for the LLM backbone. Bio-BLIP is pretrained on the task of human genetic variant annotation and achieves a 29.8% increase in generating accurate variant features over frontier LLMs. We evaluate Bio-BLIP zero-shot on downstream genomic tasks of variant prioritization and target gene prediction. Bio-BLIP outperforms two alignment-free genomic language models on regulatory variant prioritization for Mendelian disease. Across the target gene prediction task, Bio-BLIP improves accuracy over LLMs by leveraging learned genomic variant knowledge in difficult cases. Our model produces rich, transparent reasoning traces. In biological domains characterized by multiple scales of data and varied downstream tasks, Bio-BLIP offers a step toward natively multimodal, generalizable reasoning.

12

DORA: a dose-response autoencoder for interpretable transcriptome-to-viability prediction

Wang, S.; Allauzen, A.; Opuu, V.; Nghe, P.

2026-05-28 molecular biology 10.64898/2026.05.27.728125 medRxiv

Top 0.1%

19.7%

Show abstract

Predicting the effect of drugs on cell viability is a central challenge in drug discovery. Artificial intelligence holds the promise to considerably accelerate this process by leveraging rich cellular data such as transcriptomics. Current models focus on either transcriptomes or inhibitory concentrations, but they fall short in integrating these sources of information. Here, we propose DORA (Dose-Response Autoencoder), a deep learning model that predicts changes in transcriptomes and viability in a dose-dependent manner, knowing the unperturbed cell state. By enforcing a latent space consistent with cumulative dose effects, DORA matches other methods at predicting transcriptomes and substantially outperforms existing latent representations at viability prediction. The transcriptome-viability relationship provided by the model further allows the recovery of known biomarkers of cell viability while suggesting novel ones. Overall, DORA provides a unified framework delivering actionable biological insights for phenotypic drug screening and personalized medicine.

13

A geometric atlas of how ESM3 organizes modalities across depth

Steenwyk, J. L.

2026-07-12 bioinformatics 10.64898/2026.07.08.737319 medRxiv

Top 0.1%

19.6%

Show abstract

Protein language models learn general-purpose representations from large collections of protein sequences and structures, and have advanced the prediction of protein structure and function. ESM3 is a multimodal protein language model that ingests a protein through several channels at once, including amino-acid sequence, three-dimensional structure, secondary structure (SS8), solvent accessibility (SASA), and discrete functional annotations, summing their embeddings into a single residual stream. Little is known about whether these modalities occupy separate subspaces and the depth at which they fuse. The present analysis examines ESM3 (esm3-sm-open-v1; 1.4 billion parameters; 48 transformer layers) once per modality in isolation and applies representational-similarity analysis across all 48 layers. The four physical modalities (sequence, structure, SS8, SASA) begin in distinct subspaces, remain maximally separated through roughly the first half of layers, and then fuse into a shared low-dimensional subspace between layers 25 and 35. The fusion is ordered. The structure-derived modalities (structure, SS8, SASA) are mutually aligned from the input, whereas sequence joins last, after layer 28. The functional-annotation modality never fuses; instead, it remains representationally orthogonal to the physical modalities at every layer, and this orthogonality holds whether the annotation is supplied as whole-protein or per-residue, suggesting that it is content-driven rather than a tokenization arti-fact. The fusion is a learned property, absent in a randomly initialized model of the same architecture, holds at the residue level below the mean-pool, and reorganizes variance, converting between-condition variance into within-condition variance while the stream never approaches isotropy. Fusion depth is independent of protein length but is delayed by structural disorder. The phenomenon is universal across diverse organisms. Across 5,555 proteins from 12 organisms spanning eukaryota, bacteria, and archaea, every superkingdom (and every individual organism) reaches peak modality fusion at the same network depth (layer 35).

14

Learning Perturbation Effects Through Contrastive Alignment of Multimodal Biological Embeddings

Long, W.; Liu, T.; Szalata, A.; Theis, F. J.; Xue, L.; Zhao, H.

2026-06-26 bioinformatics 10.64898/2026.06.23.734145 medRxiv

Top 0.1%

18.8%

Show abstract

Multimodal single cell perturbation screens offer a scalable approach for characterizing the effects of genetic and chemical interventions on cellular state. However, most existing representation learning methods are tailored to a single perturbation modality and fail to explicitly incorporate external semantic knowledge, which limits their ability to generalize across datasets and perturbation types. Here, we introduce PertOmni, a CLIP style multimodal representation learning framework that aligns transcriptomic perturbation signatures with text derived embeddings of curated genes and compound descriptions, as well as image derived embeddings from cell paintings. PertOmni jointly trains a shared transcriptomic encoder and dataset specific text encoders using a masked contrastive objective that emphasizes within cell type discrimination while mitigating confounding effects arising from cell type heterogeneity. We evaluate the produced joint embedding space on bidirectional retrieval, drug gene interaction inference, and perturbation prediction across both small molecule and CRISPRi perturbation datasets, and demonstrate consistent improvements over strong baseline methods.

15

Predicting host-pathogen interactions using a proteome-scale language model

Malbranke, C.; Fruet, C.; Bitbol, A.-F.

2026-05-31 bioinformatics 10.64898/2026.05.29.728699 medRxiv

Top 0.1%

18.6%

Show abstract

ProteomeLM (Malbranke et al., 2025) is a proteome-scale language model trained on proteomes spanning the tree of life to reconstruct masked protein embeddings from proteome context within each species. Its attention coefficients capture protein-protein interactions without supervision. Here, we show that this capability extends to cross-species host-pathogen interactions (HPI) across ten human pathogen taxa spanning viruses and bacteria, and can be further improved with lightweight fine-tuning. We introduce ProteomeLM-HPI, a parameter-efficient adaptation via LoRA, trained on concatenated host-pathogen proteomes to reconstruct masked pathogen embeddings from host context. ProteomeLM-HPI involves two key design choices: asymmetric masking (pathogen-heavy masking) and blocked self-attention. Systematic ablations show that both choices contribute. To assess generalization, we introduce a strict cross-species benchmark enforcing pathogen-level hold-out and 40% sequence-identity filtering. On this benchmark, Proteome-HPI improves AUC on 9 out of 10 unseen pathogens.

16

Improving Variant Effect Prediction by Steering Sparse Mechanistic Features in Protein Language Models

Wang, M.; Yuan, M.; Vasilakos, A. V.; He, Y.; Ren, Z.

2026-05-15 bioinformatics 10.64898/2026.05.12.724472 medRxiv

Top 0.1%

18.6%

Show abstract

Protein language models (PLMs) like the ESM series encapsulate immense evolutionary knowledge within their high-dimensional continuous embeddings. However, these latent representations are densely entangled, obscuring the fine-grained biophysical constraints necessary for precise functional resolution. To unlock the full expressive power of these embeddings, we propose PLM-SAE, a mechanistic framework that employs Sparse Autoencoders (SAEs) to disentangle PLM representations into discrete, biologically interpretable activations. By isolating and directly intervening on critical functional features, we fundamentally enhance the structural and mutational awareness of the underlying embeddings. We rigorously validate this embedding enhancement on variant effect prediction (VEP). In the unsupervised zero-shot setting, our sparse modulation elevates the state-of-the-art ESM-3 model, yielding performance improvements across 114 deep mutational scanning datasets and delivering an 80.8% relative improvement on challenging targets like the human E3 ubiquitin ligase HECD1. Furthermore, our target-specific differentiable gating mechanism achieves consistent performance gains in over 80% of evaluated datasets with an average Spearman{rho} increase of +0.138. Finally, extending this approach to a cross-fitness multitask architecture establishes new state-of-the-art results on 17 VenusMutHub datasets, highlighted by a 169.0% performance surge in small-molecule binding predictions. Our work demonstrates that refining the highly entangled latent manifold via sparse modulation provides a robust and generalizable foundation for enhancing downstream PLM capabilities.

17

SRSA-VAE: Self-Attention-Based Feature Learning for Single-Cell Multimodal Clustering

Das, R.; Dey, A.; Maulik, U.; Bandyopadhyay, S.

2026-05-11 bioinformatics 10.64898/2026.05.06.723212 medRxiv

Top 0.1%

18.5%

Show abstract

Clustering plays a critical role in the analysis of single-cell omics data for identifying cellular heterogeneity and uncovering biological mechanisms. However, the high dimensionality, sparsity, and multimodal nature of single-cell datasets such as single-cell RNA sequencing (scRNA-seq) and Cellular Indexing of Transcriptomes and Epitopes by Sequencing (CITE-seq) pose significant challenges for effective feature learning and representation learning. Traditional dimensionality reduction methods often rely on linear transformations and fail to capture complex nonlinear relationships between gene and protein expression profiles. In this work, we propose SRSA-VAE, a scalable variational autoencoder framework that integrates a residual self-attention encoder for context-aware feature learning and multimodal representation learning. The proposed model dynamically contextualizes gene and protein representations through a self-attention mechanism, enabling the encoder to capture inter-cell relationships and emphasize biologically informative signals. A scalable residual connection further stabilizes training and preserves essential input information during latent representation learning. We evaluate SRSA-VAE on five large-scale publicly available single-cell datasets, including both scRNA-seq and CITE-seq data, and compare its performance with established deep generative models. Experimental results demonstrate that SRSA-VAE consistently outperforms existing methods in Adjusted Rand Index (ARI) across benchmark datasets, with particularly strong gains on complex immune cell populations. Ablation studies further confirm the importance of the self-attention mechanism and residual connection in enhancing model stability and clustering accuracy. The proposed model offers a generalizable, robust, and scalable solution for single-cell clustering tasks. Code Repositoryhttps://github.com/rangan2510/srsa-vae

18

VFUSE: Virulent Feature Understanding with Sparse autoEncoders

Olson, M. L.; Yu, M.

2026-06-11 bioinformatics 10.64898/2026.06.08.730928 medRxiv

Top 0.1%

18.3%

Show abstract

Generative models have shown remarkable progress in a variety of domains such as protein design, but such power enables the opaque generation of hazardous proteins. In this work, we introduce VFUSE (Virulent Feature Understanding with Sparse autoEncoders), a mechanistic interpretability approach that trains SAEs on diffusiontransformer activations to audit protein models for hazard-aware features. We apply VFUSE to RoseTTAFold3 and RFDiffusion3, popular openweight models for protein folding and synthesis. We find that for certain blocks, linear probes detect hazardous designs significantly better when fit in the SAE latent space over the original models representations: improving interpretability without sacrificing model performance. Furthermore, we identify monosemantic features from the SAE that fire only on hazardous designs at up to AUROC 0.84 (q < 10-13). To our knowledge this is the first SAE trained on an all-atom diffusion model and the first feature-level virulence audit of a protein design model, paving the way towards safe and interpretable protein design.

19

Generalizing intensive care AI across time scales in resource-limited settings

Devadiga, A.; Singh, P.; Sankar, J.; Lodha, R.; Sethi, T.

2026-04-24 health informatics 10.64898/2026.04.23.26351588 medRxiv

Top 0.1%

18.3%

Show abstract

Temporal resolution of physiological monitoring in intensive care varies widely across healthcare systems. Artificial intelligence models assume a uniform and fixed frequency of sampling, thus limiting the generalizability of models, especially to resource-limited settings. Here, we propose a novel resolution-transfer task for physiological time series and ask whether models trained on high-resolution data can generalize to a low data-density setting without the need to retrain them. SafeICU, a novel longitudinal pediatric intensive care dataset spanning ten years from a tertiary care hospital in India, was used to test this hypothesis. Self-supervised transformer models were trained on 144,271 patient-hours of high-resolution physiological signals from 984 pediatric ICU stays to learn representations of heart rate, respiratory rate, oxygen saturation, and arterial blood pressure. Transfer of this model to low-resolution data established robust performance in clinically relevant lower-frequency intervals, consistently outperforming models trained directly at coarser resolutions. Further, these representations generalized across patient populations, maintaining performance when evaluated on adult intensive care cohorts from the MIMIC-III and eICU databases without retraining. In a downstream task of early shock prediction, models achieved strong discrimination in the pediatric cohort (area under the receiver operating characteristic curve (AUROC) 0.87; area under the precision-recall curve (AUPRC) 0.92) and retained stable performance across monitoring intervals from 10 to 60 minutes (AUROC 0.78-0.88). Together, these results demonstrate that physiological representations learned from high-resolution data enable time-scale-robust and transferable AI for intensive care. The publicly released SafeICU dataset, comprising longitudinal vital signs, laboratory measurements, treatment records, microbiology, and admission and discharge, provides a foundation for developing and deploying generalizable clinical AI in resource-limited settings.

20

Dissecting and directing pathology foundation models

Kim, C.; Kaczmarzyk, J.; Savant, D.; Zhao, Z.; Koo, P.; Lee, S.-I.

2026-06-16 pathology 10.64898/2026.06.12.731496 medRxiv

Top 0.1%

18.3%

Show abstract

Foundation models (FMs) are central to digital pathology, encoding histology images into dense embeddings for facilitating diagnostic classification, molecular alteration prediction, and clinical outcome modeling. However, the opacity of these embeddings renders FM-based systems "black boxes," limiting their trustworthiness for clinical translation and utility for scientific discovery. Here, we introduce PICASSO (Pathology Image Concept Atlas built via SparSe dictiOnary learning), a framework that makes pathology FMs interpretable and controllable. PICASSO decomposes FM embeddings into human-interpretable visual concepts using a sparse autoencoder. It is trained on more than 120 million tissue patches across 32 cancer types, producing the first pan-cancer atlas of histomorphological concepts. We demonstrate that PICASSO enables diverse downstream applications of FM embeddings by exposing interpretable structure within learned representations and supporting concept-level intervention. It enables auditing of clinical model behavior by revealing the morphological features driving predictions. Beyond transparency and validation, PICASSO enables the discovery of new biological insights; for example, it identified hobnailing epithelial morphology as a previously unrecognized biomarker of EGFR mutations in lung adenocarcinoma. By linking PICASSO-derived concepts with spatial transcriptomics, we uncover associations between morphological patterns and gene expression programs. Furthermore, PICASSO allows suppression of concepts associated with technical artifacts, thereby reducing model reliance on spurious signals. Finally, PICASSO enables controlled manipulation of learned concepts to generate counterfactual embeddings for exploratory therapeutic analysis, such as modulating tumour-infiltrating lymphocyte density to assess impacts on predict survival outcomes. Together, PICASSO provides a principled framework for transforming pathology FMs into platforms for mechanistic insight and discovery.